Relevance Judgements for Assessing Recall
نویسندگان
چکیده
| Recall and Precision have become the principle measures of the e ectiveness of information retrieval systems. Inherent in these measures of performance is the idea of a relevant document. Although recall and precision are easily and unambiguously de ned, selecting the documents relevant to a query has long been recognised as problematic. To compare performance of di erent systems, standard collections of documents, queries, and relevance judgements have been used. Unfortunately the standard collections, such as SMART and TREC, have locked in a particular approach to relevance and this has a ected subsequent research. Two styles of information need are distinguished, high precision and high recall, and a method of forming relevance judgements suitable for each is described. The issues are illustrated by comparing two retrieval systems, keyword retrieval and semantic signatures, on di erent sets of relevance judgements. 2 Introduction Four decades of testing information retrieval systems suggests that automated keyword search is a very e ective means of nding relevant material. This outcome is understandable when a query is expressed using words that are speci c to the query's topic, but many have felt there must be room for improvement in the general case. Recent attempts to improve the e ectiveness of information retrieval systems include adding natural language processing techniques (Fagan, 1987; Smeaton, 1987; Gallant et al., 1992; Wallis, 1993), broadening the query using automatic and general purpose thesauri (Crouch, 1990; Thom and Wallis, 1992), document clustering techniques (El-hamdouch and Willett, 1989), and other statistical methods (Deerwester et al., 1990). The measured performance of these techniques has not been encouraging. In this paper we argue that the reason for this result may lie in biases associated with the assessment method rather than with the information retrieval techniques being proposed. The emphasis in information retrieval research has been, it seems, on users who want to nd just a few relevant documents quickly. Once such a user has found the required information, he or she stops searching. In the extreme case, these users want a system that nds a single relevant document and as few as possible non-relevant items. In other words, they need a system that emphasises high precision. In many situations, however, a user must be fairly con dent that a literature search has found all relevant material; such a user requires high recall. Su (1994) has found that users are more concerned with absolute recall than with precision. Her results were based on information requests by users in an academic environment, but her ndings are applicable to other users. For instance when lodging a new patent at the Patent O ce it is necessary to retrieve all relevant, or partially relevant, material. Finding precedence cases in legal work, and intelligence gathering, are other situations in which high recall is desirable. Some might argue that there is no need for high recall retrieval systems because existing tools are good enough. Cleverdon (1974) has suggested that, because there is signi cant redundancy in the content of documents, all the relevant information on a topic will be found in only a quarter of the relevant documents. However a user interested in high recall will need to nd much more than a quarter of the relevant documents, because it is unlikely that the rst 25% of the relevant documents an information retrieval system nds will be exactly those that contain all the relevant information. Users who need high recall information retrieval get by with existing high precision tools, but there is a need for systems that emphasise this side of the problem. Apart from practical bene ts, improving the recall of information retrieval systems is an interesting research problem. Swanson (1988) has said there are conceptual problems in information retrieval that have been largely ignored. One of the \postulates of impotence" he proposes is that human judgement can bring something to information retrieval that computers cannot because computers cannot understand text. Our work on semantic signatures goes part way toward computer understanding and shows where understanding can contribute to the information retrieval process. In this paper we show that the high recall part of the information retrieval problem may have been ignored, not because people have found the problem esoteric or uninteresting, but because they have not had the tools to e ectively test ideas. This paper is structured as follows. In the section on relevance judgements and information retrieval we look at what it means for a document to be relevant and how relevance has been assessed in some existing test collections. In that section we also de ne recall and precision, and describe the tradeo between them. In the section on information retrieval and assessing recall we describe how information retrieval measurement is usually biased towards high precision rather than high recall. We also discuss the limitations of using the TREC collection and discuss why this collection does not solve the problem of assessing recall for information retrieval systems that support ad hoc queries. In the section comparing two systems for recall we describe information retrieval based on semantic signatures and compare it with standard 3 keyword retrieval. The relevance judgements of one popular test collection are re-evaluated, and we show that on a small test set of queries, whereas the existing relevance judgements support keyword retrieval, semantic signatures have a signi cant advantage using the revised set of relevance judgements. Relevance judgements and information retrieval Information retrieval systems are usually considered to contain a static set of documents, from which a user is wanting to extract those documents she or he will nd interesting. Some documents are relevant to the user, and others are not. The e ectiveness of any query can be measured by computing recall and precision gures based on a list of documents that are considered to be relevant. For a given query and information retrieval system, recall measures the number relevant documents retrieved as a proportion all relevant documents, and precision measures the number of relevant documents retrieved as a proportion all documents retrieved. The perfect system would return a set of documents to the user containing all the relevant documents, and only the relevant documents. Such a query would have recall of 1:0 and a precision of 1:0. Critical to computing recall and precision gures is determining lists of relevant documents for given queries. Intuitively a relevant document is one that satis es some requirement in the user's head. Saracevic (1975) describes this approach to relevance as \a primitive \y' know" concept, as is information for which we hardly need a de nition". This concept of relevance was the basis of early relevance judgements in which the person who formulated the query chose the relevant documents. Only one person (the requester) was asked to collect the judgements for each request, and dichotomous assessments were made to declare each document either relevant or not (Lesk and Salton, 1969). The position that the requester knows what she or he wants, and is therefore the person who knows what is relevant, is entirely reasonable, but it is not necessarily a good means of assessing the e ectiveness of a retrieval system. What a user wants may not be what she or he describes. There are several approaches taken to this problem. The simplest is to ignore it; recall and precision can only be used comparatively anyway and Lesk and Salton (1969) have shown that user judgements can be used e ectively to compare the performance of di erent systems. At the other extreme, many have tried to nd a formal de nition of relevance that would allow us to say de nitively whether a text was relevant or not. A good example of this approach is Cooper's (1971) de nition of logical relevance. More recently the TIPSTER and TREC projects employ specially trained relevance assessors (Harman, 1992) who, it is assumed, can make consistent and accurate assessments of relevance. Naturally it is optimistic to expect a third party, be it a machine or an information o cer, to nd what the author of a query wants rather than what the author actually requests. Having a machine retrieve all and only the documents the user considers relevant from a badly worded query would tell us more about the user's mind than about the ability of the machine to convert queries into useful sets of documents. In many cases users do not express their desires clearly and the same query can be given by two users with signi cantly di erent meanings. Consider the four queries shown in Figure 1 that are selected from the Communications of the ACM (CACM) test set of 64 queries and 3204 documents provided with the SMART information retrieval system (Buckley et al., 1988). Query 19 is quite ambiguous. Does the author of this request want examples of parallel algorithms, or information about parallel algorithms? And does the author want everything on parallel algorithms, or simply a few examples? Figure 2 illustrates the di erence between user judgements and those of a third party. The problem of judging relevance has been around for a long time and the idea of third party judges is not new. There can be signi cant disagreement, not only between a third 4 Query 15 Find all discussions of horizontal microcode optimisation with special emphasis on optimisation of loops and global optimisation. Query 19 Parallel algorithms Query 39 What does type compatibilitymean in languages that allow programmer de ned types? (You might want to restrict this to \extensible" languages that allow de nition of abstract data types or programmer-supplied de nitions of operators like *, +.) Query 64 List all articles on EL1 and ECL (EL1 may be given as EL/1; I don't remember how they did it). Figure 1: Four queries from the CACM test collection. party and the author of the query, but also amongst third party judges working on the same query. In 1953 a large scale experiment was designed to compare the retrieval e ectiveness of two information retrieval systems developed by di erent institutions. Two sets of relevance judgements were provided by the separate groups for a test collection consisting of 15,000 technical documents and 98 questions. There were 1390 documents that the two groups agreed were relevant, and another 1577 that one, but not both thought relevant { \a colossal disagreement that was never resolved" (Swanson, 1988). If relevance judgements are so unstable, how can they be used as the basis for objective measurement of information retrieval system performance? Cuadra and Katter think they cannot, and say the rst and most obvious implication is that one cannot legitimately view `precision' and `recall' scores as precise and stable bases for comparison between systems or system components, unless [appropriate controls are introduced].(quoted in (Lesk and Salton, 1969)) In spite of these problems, Lesk and Salton (1969) have argued that, for the purposes of assessing information retrieval systems, it does not matter whether the relevance assessments of the author, or a third party, are used. They found that although their volunteers did indeed gave inconsistent relevance judgements, these judgements could still be used to compare the e ectiveness of information retrieval systems. They show experiments in which the relevance judgements of the author of a query, and a second person were compared over 48 queries. On average there was about a 30% overlap between the two sets of judgements (ranging from an average of about 10% for one author to 53% for another | the actual queries are not provided). Similar di erences between user judgements and second person judgements have been reported elsewhere in the literature (Janes, 1994). Lesk and Salton (1969) show that as long as relevance judgements are used to compare systems, it does not seem to matter which person's relevance judgements are used and a technique for information retrieval that performs well on one set of judgements will perform well on others as well. The explanation they provide for this phenomenon is that there is substantial overlap on the set of documents that are \most certainly relevant to each query". They conclude: ... that although there may be considerable di erence in the document sets termed relevant by di erent judges, there is in fact a considerable amount of agreement for those documents which appear most similar to the queries and which are retrieved early in the search process (assuming retrieval is in decreasing correlation order with the queries). Since it is precisely these documents which largely determine retrieval performance it is not surprising they nd that the valuation output is 5 User judgements Third party judgements Parallel algorithms Fast Parallel Sorting Algorithms Comments on A Paper on Parallel Processing Optimal Code for Serial and Parallel Computation Some thoughts on Parallel Processing How can several processors be used for sorting? Figure 2: Users, queries, and relevance. 6 substantially invariant for the di erent sets of relevance judgements. (Lesk and Salton, 1969, p355) Because everyone agrees about which documents should be found rst, systems aimed at the high precision end of the information retrieval problem can be tested (comparatively) on anyone's judgements. In the next section we assume that third party judgements are an acceptable source of relevance assessments, but nd that accepting Lesk and Salton's position introduces a bias in the test procedure toward systems which emphasise precision. This bias is acknowledged by Lesk and Salton, but it seems the presence of this bias has been forgotten over the last twenty years. A second bias, again acknowledged by Lesk and Salton, is introduced by the way information retrieval systems which rank documents are assessed. An information retrieval mechanism employing ranking provides the user with a list of documents that the system considers relevant. The list is presented to the user in decreasing order of relevance | according to the system. Ranking systems can deal with what are often called \noise words". This feature allows users to enter their query in natural English. Consider phrases in four queries shown in Figure 1 such as \I don't remember" in the CACM query 64, \Find all" in query 15, and \what does" in query 39. As far as keyword retrieval is concerned, these add no useful information to the query (and would not be used as part of a boolean query), but they do not seem to hinder the performance of ranking systems that incorporate stop-lists and term weighting. Such phrases can thus be left in and enable users to express their desires in a form that comes naturally. Given an information retrieval system that ranks documents, performance can be assessed using recall and precision gures by comparing the list of documents returned with the set of documents judged as relevant in the test collection. For instance, a keyword retrieval system may return documents in the following order for query 64. D2651 D1307 D2513 D793 ... According to the relevance judgements provided, the document numbered D2651 is relevant, and indeed it is the only relevant document. The system has done well and the rst document the user looks at will be useful. When a ranking information retrieval system does not perform perfectly however, assessment is more di cult. Consider a ranking system executing Query 15 from the CACM collection. It has 10 relevant documents, and documents are returned by the system in the following order. D820 D2835 D3080 D2685 D307 D2929 D2616 D2344 D3054 D1466 D113 D1461 D658 D1231 ... Of these the underlined document identi ers are the only relevant ones in the top fourteen documents. The precision at the 10% level of recall is calculated as if the system had stopped after the fourth document, giving 25% precision. The next relevant document occurs at position 14, and the next a position 20. We plot the precision at 10% increments of recall as the broken line in Figure 3. The performance of information retrieval systems is fairly uneven, and so results are usually presented as the average precision at xed levels of recall across all queries in a collection.1 Calculating a precision value for an information retrieval system is relatively simple: the user looks at the texts and decides which texts she or he actually want. The precision is the number the user wants, divided by the number the user has looked at. Finding recall is more di cult in that the calculation requires knowledge of all relevant documents in the 1Variations on this are sometimes used, for example one of the evaluation gures produced for the TREC experiments interpolates for each query high precision gures to all lower levels of recall. The evaluation procedure trec eval() is available from the ftp site ftp.cs.cornell.edu with the SMART information retrieval system. 7 recall p r e c i s i o n 20% 40% 60% 80% 100% 20% 40% 60% 80% 100% Query 15 Average over 54 queries. 10 point average: 20.0% Figure 3: Precision at various levels of Recall for Query 15. collection. Having the user exhaustively search the collection to nd all relevant material is an expensive task in document collections of signi cant size. Methods for predicting the number of remaining relevant documents using statistical methods have not been successful (Miller, 1971). An alternative to exhaustively searching the document collection is what is known as the pooled method. Using this test method, two or more information retrieval systems are used on the same query with the same document collection. The top N documents from each system are pooled, and judged for relevance. The judge does not know which system found which documents. The relevance judgements can then be used to compare the relative recall of the various systems. This is the approach taken in the TIPSTER project and the related TREC project (Harman, 1992). In some cases researchers have claimed that, by using a range of di erent information retrieval systems, the pooled method has found \the vast majority of relevant items" (Salton et al., 1983). If this assumption holds, then the collection and relevance judgements can be used to assess other systems without having to reassess retrieved documents. Often the performance of a system is summarised with a single gure that is the average precision at all levels of recall. This provides a crude but easy way to compare the performance of di erent information retrieval methods. Consider the performance of the information retrieval system plotted as the solid line in Figure 3. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% .44 .36 .30 .24 .18 .15 .10 .08 .06 .06 When these are averaged, the performance of this particular system can be summarised with the single gure, 20.0% average precision on a 10 point scale. Information retrieval and assessing recall Unfortunately, both averaging precision gures, and relying on a single set of relevance judgements, introduces a bias towards high precision. First consider our keyword system ranking documents for query 15. The top fty documents in the ranked list is represented below, from 8 left to right, with dots representing non-relevant documents and hashes representing relevant documents. ...#.........#.....#.#............................ The average precision for the system on this query is 9%. Now consider two hypothetical systems. The rst system nds a high ranking relevant document sooner (that is, given an even higher rank) so improving precision. Let us assume it moves the rst relevant document in the list from position 4 to position 1: #............#.....#.#............................ This improvement almost doubles the average precision of the system to 16.2%. The second system improves recall by making low ranking relevant documents, which are unlikely to be seen by the user, more accessible. Let us assume it nds all relevant documents in the top fty....#.........#.....#.#......................###### This improvement once again almost doubles the average precision of the system to 16.1%. To a user after the rst relevant document, the inconvenience of having to scan three irrelevant titles is small; users wanting many more relevant documents are greatly inconvenienced if they have to scan hundreds or thousands of titles to nd them. The hypothetical high recall system is far more use to the author of the query who wanted to \ nd all" material. The rst alternative is of questionable bene t even when the user is only after the rst relevant document, and the di erence in utility of the two mechanisms is not shown when average precision is used to summarise system performance. Those interested in improving the recall of information retrieval systems must compare precision at individual levels of recall and not be misguided by an overall average. Taking the average precision over various levels of recall has a strong bias toward systems that nd a few relevant texts early, but it is a bias that can be recognised and controlled. Unfortunately there is a similar bias built into the way relevant documents have been judged for some test collections. Lesk and Salton's work, discussed in the previous section, argues that any set of relevance judgements can be used for assessment because all include those documents which are \certainly relevant". They conclude that the e ectiveness of information retrieval systems can be compared on such judgements. Note that such relevance judgements give no indication of the e ectiveness of an information retrieval system in an absolute sense, and a question of some import would seem to be just exactly how much room is there for improvement in information retrieval systems. Consider a retrieval system that found between 10% and 50% of the documents judged relevant. Such a system appears to be doing as well as any human judge from the Lesk and Salton results and, given this is a reasonable gure for a boolean system to achieve, it would seem there is little point to further research. Here we suggest that a less comparative method of assessment could be had by explicitly nding all and only the documents that are certainly relevant, and using those for assessment purposes. A method suggested by their work is to take the intersection of several sets of judgements, and use that as the set of relevant documents for a query. Such a test collection would give some idea of the absolute value of a text retrieval system for those users who wish to nd something quickly on their topic of interest. But what about users who want to nd all relevant documents? Many researchers in the area have had users provide relevance judgements at various levels (Sparck Jones and van Rijsbergen, 1976). However this is not a simple thing to express to judges, nor does there seem to be any reason to choose 2 levels of relevance over say 5. Frei (1991) has proposed a test mechanism that takes this view to its extreme and compares two ranked lists of documents. We advocate instead that the \perfect" information retrieval system aimed at high recall 9 would retrieve all documents considered relevant to any judge. The relevant documents for testing such a system would be the union of all the relevance sets. With this view of relevance it is not possible for an information retrieval system to achieve perfect retrieval (100% precision at 100% recall) because di erent users have di erent ideas as to what is relevant. Without looking into the minds of each user, the ideal information retrieval system would retrieve all documents certainly relevant (those in the intersection test set) followed by all those thought relevant by at least one judge (those in the union) followed by the rest. The performance of such a system would be extremely di cult to assess in absolute terms, and so we advocate using two distinct relevance sets for testing purposes: the intersection of the relevance judgements for those research projects focusing on high precision, and the union of these sets for those interested in high recall. The ultimate system would perform well on both sets of relevance judgements. Lesk and Salton's conclusions were incorporated in the creation of test collections such as CACM, and defended the status of others. Anyone using these collections must keep in mind the assumptions behind their creation. But some might question the importance of this given the recent TIPSTER initiative. TREC has several appealing features including signi cantly more documents, unambiguous queries, and relevance assessment done by professionals. We believe however that these advantages introduce problems that must be considered. The TREC project grew out of an interest in routers and lters sponsored by DARPA, the United States Defense Advanced Research Projects Agency. Although there are considerable similarities between such mechanisms and information retrieval systems, the people using lters can have considerably di erent interests to those wanting an information retrieval system. We have already suggested that many doing information gathering are, unlike Lesk and Salton, going to be interested in nding more than just the rst few relevant documents. The TREC project's professional relevance assessors, in combination with the extended \topics" rather than queries, give the meaning of \relevance" a conciseness unattainable with test collections assembled with volunteer judges. Although the TREC project provides an environment with fewer disputes about what is relevant and what is not, this is not a feature of the average library catalogue search. Ad hoc queries are not only short (a point that has recently been addressed by the TREC organisers with the use of the summarization eld but the same relevance judgements) they are also, we claim, inherently ambiguous. \Parallel algorithms" is a genuine ad hoc query, no matter how scienti cally unappealing it may be. Proper evaluation of ad hoc queries can only be carried out with multiple judgements and naive users. The size of the TREC document collection also introduces problems. Creating a test collection with exhaustive relevance judgements is an extremely expensive process and is not feasible with a collection as big as TREC. The alternative is to use the pooled method to compare competing systems. This is expensive in that each new comparison requires more documents to be judged, and thus the number of groups participating in the TREC competition must be limited. Researchers who are not able to participate might use the TREC documents and queries, and the released relevance judgements. A problem with this approach to testing is the status of unjudged documents. There is a tacit assumption that any documents that have not been found by the contractors and participants are not going to be relevant. If one believes that keyword retrieval is e ective at nding signi cant numbers of the actual relevant documents, then, with enough participants, all the relevant material will be found. But this line of argument relies on one's faith in keyword retrieval being good for high recall. The more divergent new systems become from the other systems participating, the more likely it is that the new system will be nding relevant, unjudged documents. Thus, care must be taken when using the released relevance judgements to test novel systems. 10 Comparing two systems for recall The remainder of this paper illustrates the problems of measuring high recall by comparing two retrieval mechanisms over a limited set of queries using a new set of relevance judgements designed for assessing high recall. In this section we describe a new information retrieval mechanism that we expect may give better recall than keyword retrieval. The semantic signature mechanism is based on the assumption that relevant documents will partially paraphrase the users query. There is thus room to improve information retrieval systems by having them recognise when the same idea is being expressed in di erent words. Natural languages such as English allow great diversity in the way ideas are expressed. Syntactic variations can be dealt with formally by, for instance, converting all texts to their active form. In information retrieval, a less formal and quicker mechanism is used and all word order is removed and texts are treated as sets of words. This works, we argue, because the semantic structure of a text is primarily carried by the lexical preferences associated with the words themselves. As an extreme example of this process, there is only one way to assemble a meaningful sentence from the three words \police", \door", and \sledge-hammer". Variations in word meanings have been tackled using thesaurus-like mechanisms. This approach however does not capture paraphrases in which single words are replaced with longer texts. The semantic signature mechanism attempts to deal with variations in the meaning/text mapping at the text level by constraining the vocabulary in which ideas are expressed. Keyword retrieval of documents written in a restricted language, by queries written in the same restricted language is likely to signi cantly reduce the number of misses caused by variation in language use. The vocabulary in which semantic signatures attempt to capture ideas is that used in the de nitions in the Longman Dictionary of Contemporary English (LDOCE). The de nitions in LDOCE have been written from a restricted vocabulary of approximately 2300 words. The technique we use relies on the assumption that the de nitions in LDOCE have been written using Liebniz's substitutability criteria for a good de nition: each word in the text of documents and the query is replaced with the appropriate de nition from LDOCE. If the de nitions are good, the meaning of the text remains unchanged2. The aim is to imitate a system that paraphrases both documents and queries in a language with a restricted vocabulary, and then does conventional information retrieval on the new representations of the documents (and queries) meaning. In terms of the vector-space model, conventional keyword retrieval places documents and queries in N -dimensional space, where N is the number of unique words in the document collection. When the cosine similarity measure is used, the similarity (relevance) of a document to a query is the cosine of the angle between the appropriate vectors. The semantic signature mechanism is simply using a smaller set of features, and using LDOCE as the required mapping function. When a keyword information retrieval mechanism is used, the features are the words used to describe the concept. They need not be, and several attempts have been made to map the words to better features before placing the concept in the concept space (Deerwester et al., 1990; Gallant et al., 1992; Wallis, 1993). The major problems with implementing this technique are in devising a reasonable set of better features, and devising a mapping from the words appearing in the text of documents and queries, to a suitable feature-based representation. The semantic signature mechanism uses LDOCE to solve both these problems. Mapping words to dictionary entries is not straight forward. Two lters are used: one to remove a xes, and another to select the required sense of the word. Both these procedures are known to degrade a retrieval systems performance3 and so, to maintain a level playing eld, the keyword mechanism we use has the a xes removed and words are tagged with their homograph number. This process is compared with a conventional stemming lter in 2Liebniz actually wanted the truth of the text to remain the same as before the substitution. 3Recent tests by the authors on TREC suggest otherwise. 11 Raw Text Stemmer Morpher Sense Selection Signature Gennerator Semantic Signatures Keyword Test Conventional Text Retrieval Figure 4: Text processing requirements for semantic signature tests. Figure 4. The keyword test takes the original text and replaces each word that appears in LDOCE with its sense-selected root word. As an example, query 25 from the CACM test set asks for \Performance evaluation and modelling of computer systems". This text is passed through the above lters to become the set of sense-selected words: fperformance2 evaluate1 model8 computer1 system4g. There is now a one-to-one correspondence between the terms in the text and de nitions. When the words are replaced with the appropriate de nition for each term, the text becomes the set of \primitives": act action before character music perform piece play public trick calculate degree value model calculate electric information machine make speed store body system usual way work Note that although the textual representation of query 20 in the semantic signature format is large, the vocabulary there are highly constrained and so a practical implementation of this retrieval mechanism could be quite e cient. To illustrate the utility of combining relevance judgements in di erent ways, we would have liked to take a collection of documents and a set of queries and have several people make relevance judgements on each query and all the documents. This is an exceedingly labour intensive task and so we have restricted the size of the experiment. We chose instead to use the existing CACM document and query test collection because it is widely known and used by many researchers. The aim of the experiment is to create relevance assessments suitable for information retrieval systems that emphasise recall, and to test the provided relevance judgements for CACM against those documents that are certainly relevant. There are 3204 documents in the CACM collection; these documents consist of titles and, in most cases, abstracts. Obtaining relevance judgements for all of these is beyond the extent of volunteer labour even for one query. In order to limit the number of documents each person would need to examine, we have followed the TREC procedure and used the pooling method of system assessment. That is, each system is run on the collection with each query; the 12 Query id number of relevant documents (provided with SMART) Q6 Interested in articles on robotics, motion planning particularly the geometric and combinatorial aspects. We are not interested in the dynamics of arm motion. 3 Q7 I am interested in distributed algorithms concurrent programs in which processes communicate and synchronise by using message passing. Areas of particular interest include fault-tolerance and techniques for understanding the correctness of these algorithms. 28 Q19 Parallel algorithms 11 Q20 Graph theoretic algorithms applicable to sparse matrices 21 Q25 Performance evaluation and modelling of computer systems 51 Q36 Fast algorithm for context-free language recognition or parsing 20 Q61 Information retrieval articles by Gerard Salton or others about clustering, bibliographic coupling, use of citations or co-citations, the vector space model, Boolean search methods using inverted les, feedback, etc. 31 Figure 5: The 7 queries from CACM used in these experiments. best K documents according to each system are collected and judged as relevant or not. The precision for each system is then calculated in the usual manner, and a comparative result for the recall can be given. In other words instead of saying that system X found 30% of the relevant documents in the collection, one can say that system X found 20% more relevant documents than system Y . In the TREC experiments the number of documents examined from each system, K, is 100 for TREC participants, and 200 for contractors. The relevance assessments are made by a single expert provided by The National Institute of Standards and Technology. In our experiments K is 80, and the judgements are made the authors of this paper. The information retrieval systems we compare are the semantic signature mechanism described in the previous section and a keyword mechanism. There are 64 queries in the CACM test collection, and between 80 and 160 documents per query requiring consideration, which was still too many relevance judgements to make. We set out to select approximately 10 queries that had similar performance for each system when compared using the supplied relevance judgements. We also wanted the chosen queries to be representative of the overall performance of each system. Both systems achieve around 20% overall average precision on a ten point scale, and so all queries that had an average precision of 20% 5% on both systems were chosen for these tests. This gives 7 queries: 6, 7, 19, 20, 25, 36, and 61. The queries themselves are shown in Figure 5. This selection of queries, although not a random4 selection, provides a relatively diverse range of query styles, and a signi cant variation in the amount of relevant material. The results of the relevance judgements reported here and characterised below were attained by having the authors of this paper make the relevance judgements. We believe this does not introduce a bias because, using the pooled method, there was no way for the judges to know which system provided which documents. Although Judge-1 and Judge-2 found signi cantly more relevant material than the SMART assessors, the overlap between Judge-1 4Since the selection of queries was not random the test cannot be considered a fair test of semantic signatures versus keyword retrieval, however in this paper we are primarily concerned with the assessment mechanism not testing particular information retrieval mechanisms. 13 judge(s) Qry 6 Qry 7 Qry 19 Qry 20 Qry 25 Qry 36 Qry 61 sum smart 3(3) 16(28) 8(11) 3(21) 21(51) 13(20) 21(31) 85(147) jdge1 3 34 40 13 48 18 55 211 jdge2 7 40 38 11 61 22 47 220 jdge1 \ smart 0 15 8 2 20 9 19 73 jdge2 \ smart 3 16 8 2 21 11 17 78 jdge1 \ jdge2 1 32 37 9 45 15 43 182 jdge1 \ smart \ jdge2 0 15 8 2 20 9 17 71 jdge1 [ smart [ jdge2 9 42 41 16 64 27 61 260 Figure 6: Agreement of Relevance Judgements | Wallis and Thom. and the SMART judge is empty for query 6, and neither new judge was lenient enough to nd all documents judged relevant by the SMART assessors even though both chose more than twice as many relevant documents. As with other non-user judges in similar tests, we considered more documents relevant than the original authors of the queries. Figure 6 provides the same information as that provided for the Lesk and Salton experiment from 1969. The intersection of the two judges relevance judgements is often larger than the original set of judgements, and this, once again, is presumably a product of the small number of relevance assessors participating. We compared the semantic signature mechanism with the keyword mechanism using three di erent test sets of relevance judgements: the original CACM relevance judgements; the intersection of our judgements; and nally the union of our judgements. The second test set is appropriate for comparison of keyword and semantic signature mechanisms when precision is the emphasis, and the third test set is appropriate when recall is important. The results are presented as the amount of relevant documents after xed numbers of viewed documents. We do not use precision at levels of recall because, using the pooled method, the performance of either system does not indicate anything about the overall number of relevant documents in the collection. We cannot, therefore, calculate actual recall. Figure 7 plots precision over the rst ranked 50 documents for the seven queries, using the original CACM judgements, on the two systems. When the number of relevant documents is summed for the 10 rst ranked documents for each system, the keyword mechanism nds one more document than the semantic signature mechanism. Over the rst 50 ranked documents the performance is about the same. This is not surprising in that the parameters of the semantic signature mechanism were chosen to give the best performance using the original relevance judgements. Figure 8 shows the same comparison when the intersection of the two sets of judgements are used. This is the test aimed at high precision and, although the performance does not vary in accord with the SMART judgements, this is probably because of the bias of the assessors toward things being relevant, and the e ect seen using SMART judgements would become obvious with more relevance judges participating. Figure 9 compares the two systems for high recall. In this case the performance is about the same for the rst 20 documents, but then the semantic signature mechanism starts to nd more documents. By the time the user has looked at 50 documents, the user has found an average of 4 more documents per query. This represents about a 20% increase in the number of found relevant material. If the user is interested in nding more later, rather than a few sooner, the semantic signature mechanism has a signi cant advantage. This performance does not appear to drop o . We have examined the ranking up to the 80 document level and the semantic signature mechanism maintains a solid 20% advantage. We conclude that the semantic signature mechanism is worthy of further investigation in the context of high recall. This contrasts sharply with the conclusion drawn from the results based on the SMART relevance judgements. The semantic signature mechanism is 14 Number of documents viewed 0 10 20 30 40 50 Percent precision 10 20 30 40 50 60 70
منابع مشابه
Examining the Robustness of Evaluation Metrics for Patent Retrieval with Incomplete Relevance Judgements
Recent years have seen a growing interest in research into patent retrieval. One of the key issues in conducting information retrieval (IR) research is meaningful evaluation of the effectiveness of the retrieval techniques applied to task under investigation. Unlike many existing well explored IR tasks where the focus is on achieving high retrieval precision, patent retrieval is to a significan...
متن کاملRelevance Judgments for Assessing Recall
-Recall and Precision have become the principle measures of the effectiveness of information retrieval systems. Inherent in these measures of performance is the idea of a relevant document, Although recall and precision are easily and unambiguously defined, selecting the documents relevant to a query has long been recognized as problematic. To compare performance of different systems, standard ...
متن کاملWhy Assessing Relevance in Medical IR is Demanding
This study investigates if and why assessing relevance of clinical records for a clinical retrieval task is cognitively demanding. Previous research has highlighted the challenges and issues information retrieval systems are faced with when determining the relevance of documents in this domain, e.g., the vocabulary mismatch problem. Determining if this assessment imposes cognitive load on human...
متن کاملRanking Retrieval Systems with Partial Relevance Judgements
Some measures such as mean average precision and recall level precision are considered as good system-oriented measures, because they concern both precision and recall that are two important aspects for effectiveness evaluation of information retrieval systems. However, such good system-oriented measures suffer from some shortcomings when partial relevance judgments are used. In this paper, we ...
متن کاملThe future-orientation of memory: planning as a key component mediating the high levels of recall found with survival processing.
In a series of papers, Nairne and colleagues have demonstrated that tasks encouraging participants to judge words for relevance to survival led to better recall than did tasks lacking survival relevance. Klein, Robertson, and Delton (2010) presented data suggesting that the future-directed temporal orientation of the survival task (e.g., planning), rather than survival per se, accounts for the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995